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Abstract 

Cryptomonads, are a lineage of unicellular and mostly photosynthetic algae, that acquired their plastids through 
the "secondary" endosymbiosis of a red alga — and still retain the nuclear genome (nucleomorph) of the latter. We 
find that the genome of the cryptomonad Guillardia theta comprises genes coding for 13 globin domains, of which 
6 occur within two large chimeric proteins. All the sequences adhere to the vertebrate 3/3 myoglobin fold. 
Although several globins have no introns, the remainder have atypical intron locations. Bayesian phylogenetic 
analyses suggest that the G. theta Hbs are related to the stramenopile and chlorophyte single domain globins. 
Reviewers: This article was reviewed by Purificacion Lopez-Garcia and Igor B Rogozin. 
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Findings 

Endosymbiosis is a fundamental process that has shaped 
the diversity and evolution of unicellular eukaryotes [1]. 
"Primary endosymbiosis" - the uptake of a photosyn- 
thetic cyanobacterium by an early nonphotosynthetic 
unicellular eukaryote gave rise to the double membrane- 
bound plastids of glaucophytes, red and green algae, and 
land plants [1,2]. Subsequent endosymbioses of primary- 
plastid-containing eukaryotes by nonphotosynthetic hosts, 
called "secondary endosymbiosis", gave rise to much of 
the present day diversity of protists [3]. Although in most 
protist lineages, reduction of the endosymbiont nucleus 
has been completed, a remnant nucleus, the nucleomorph 
is still present in two protist lineages, the cryptomonads 
and the chloroarachniophytes [4]. The nucleomorphs of 
these two groups have independent origins: the cryptomo- 
nad plastid and nucleomorph are of red algal ancestry 
[5,6] whereas the chlorarachniophyte plastids and nucleo- 
morphs are of green algal origin [7,8]. Several cryptophyte 
genomes have been sequenced, including Guillardia theta, 
Bigelowiella natans, Hemiselmis andersenii and Crypto- 
monas pammecium [7-10]. Here, we report the presence 
of hemoglobin genes in the host nuclear genome of G. 
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theta and on the relationship of the sequences they en- 
code to protist and animal globins. 

We find 13 globin domains in 9 proteins in the nuclear 
genome of Guillardia theta from the assignments listed 
on the SUPERFAMILY web site (http://supfam.org), based 
on a library of hidden Markov models [11,12]. All the se- 
quences were subjected to a FUGUE search [13] (www- 
cryst.bioc.cam.ac.uk), a stringent test of whether a given 
sequence is a globin [14,15]. Our criteria for accepting a 
sequence to be a true globin, are the following: a FUGUE 
Z score >6 (corresponding to 99% probability), the occur- 
rence of a His residue at position F8 and proper alignment 
of helices BC through G, satisfying the myoglobin fold 
[16]. Although 7 domains occur in single domain (SD) 
globins, 6 occur within 2 large (>1000 residues) chimeric 
proteins, 1060 residues (EKX33177.1) and 1497 residues 
(EKX39126.1), both of which appear to have a putative 
cytochrome b5 N-terminal. The 13 globin domains exhibit 
identity scores ranging from 7 to 70% (Additional file 1). 
A MAFFT alignment [17] of the 13 domains with sperm 
whale Mb is shown in Additional file 2. All the G. theta 
Hbs have a His at the proximal position F8, except for the 
275 residues globin (EKX43967.1). Interestingly, the latter 
contains a potential myristoylation site predicted with very 
high probability, a post-translational modification ob- 
served in several vertebrate and invertebrate globins 
[18,19]. At the distal position E7, the majority of the resi- 
dues are hydrophobic and at position GDI all globins 
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contain a Phe. Furthermore, the globin domain D3 of the 
1060 residue chimeric protein (EKX33177.1), appears to 
lack the H helix. Thus, apart from two defective se- 
quences, all the observed G. theta globins appear to be 
fully functional, and their alignment with the sperm whale 
Mb sequence (Additional file 2), demonstrates clearly their 
adherence to the canonical Mb-fold [16]. 

A Bayesian phylogenetic analysis [20] of a MAFFT L- 
INS-i alignment [17] of the 13 G. theta globin domains 
provided an unrooted tree shown in Figure lA. They form 
3 separate clusters: the small SD globins (EKX39152.1, 
EKX39124.1, EKX33440.1, EKX33112.1 and EKX46654.1) 
group with the D2 domains of the two chimeric proteins, 
the Dl and D3 domains form another cluster and the de- 
fective globin lacking the F8 His (EKX43967.1) occurs as 
an outlier. Figure IB depicts the phylogenetic tree result- 
ing from a molecular Bayesian analysis of a multiple se- 
quence alignment (MSA) of G. theta Hbs representative 
of the three clusters observed in Figure lA and sequences 
representing 8 vertebrate globin families (Ngbs, Cygbs, 
GbX, GbY, GbE, Mb, HbA and HbB), 2 cyclostome Hbs, 5 
choanoflagellate Hbs as well as 26 protist sequences, in- 
cluding chlorophytes, haptophytes, stramenopiles, rhodo- 
phytes, alveolates, ichthyosporeans and fUastereans. We 
employed Clustal Omega [21] for the MSA and GUID- 
ANCE [22] to assess the quality of the MSA and improve 
it via removal of low-scoring columns. We used as out- 
group either the 2 Bacillus nonheme globins [23] or plant 
3/3 Hbs, including one LegHb and 2 NsHbs. Although the 
Bacillus nonheme globins have the 3/3 Mb fold, their 
heme binding cavity is defective due to wider separation 
of helices. Consequently, we think that they represent the 
optimal outgroup for globin phylogeny. The sequences 
used in our analysis are provided in Additional file 3. The 
animal and protist sequences are widely separated, with 
the G. theta Hbs clustering together surrounded by stra- 
menopUe, rhodophyte and chlorophyte sequences. A 
major concern in globin phylogeny, is the poor statistical 
support for the nodes occurring in phylogenetic trees, ir- 
respective of the MSA's employed, the type of phylogen- 
etic analysis and the evolutionary models used. On one 
hand, the globin sequences are relatively short, and on the 
other, the low identities found in pairwise alignments of 
distantly related globins results in the masking of phylo- 
genetic signals coded within the sequences. We sought to 
extend the aforementioned result by performing add- 
itional MSA's and additional molecular phylogenetic ana- 
lyses, including Maximum Likelihood (ML) using MEGA 
version 5.2 [24]. The result of a ML analysis of the same 
set of sequences as in Figure IB, aligned using Clustal 
Omega, shown in Additional file 4, is in broad agreement 
with the Bayesian tree, despite low bootstrap support. We 
show Bayesian trees of MSA's using the same sequences as 
in Figure IB and Additional file 3, based on MAFFT 



L-INS-i (Additional fUe 5) and MUSCLE [25] (Additional 
file 6). They are in broad agreement with the Bayesian tree 
based on the Clustal Omega MSA (Figure IB). The G. 
theta Hbs again cluster together with several stramenopile 
Hbs. Furthermore, all Bayesian trees reproduce the known 
phylogenetic relationships between the vertebrate globins, 
determined by J. Storz and colleagues [26-28]. A diagram 
of the latter is provided as Additional fUe 7. 

Although the genes coding for the numerous G. theta 
Hbs are part of the nuclear and not the nucleomorph 
genomes, their very presence is remarkable in a unicellu- 
lar organism that has undergone two endosymbiotic 
events and the succeeding, extensive reductions and pro- 
cessing of the plastid and nucleomorph genomes. The 
locations of introns in G. theta Hbs are also unusual. In 
vertebrates, intron locations are fairly constant, with two 
introns inserted at conserved positions B12.2 (intron lo- 
cated between codon positions 2 and 3 of the 12th 
amino acid of globin helix B) and G7.0 [29]. In contrast, 
intron locations appear to be highly variable in proto- 
stome phyla, e.g. in nematodes [30-32] and in Chirono- 
mus [33]. In the case of G. theta Hbs, we find that 
introns are absent in 9 out of 13 globin genes. The 
remaining globins contain intron insertion sites at atyp- 
ical positions such as B2.0, El.l, E15.0 and F8.0 for 
EKX43967.1 (Additional file 2). The D3 globin domains 
in the two chimeric, multidomain proteins (EKX33177.1 
and EKX39126.1) are interrupted by 4 introns: they 
share the Cl.O and interhelical EF7.0 insertion positions 
combined with intron insertions at B4.0 and at E6.2 for 
EKX33177.1, and at G19.0 and HU.O for EKX39126.1. 
Absence of introns has also been observed in 3/3 globins 
from Archaeplastida genomes C. merolae, O. tauri, M. 
pusilla, andM. sp. RCC299 [34]. 

The presence of several globins in the cryptomonad G. 
theta adds yet another puzzle in the pursuit of an ex- 
planation for the physiological role of Hbs in microbial 
eukaryotes. The possible functions of Hbs in bacteria, 
fungi and protists have been reviewed recently (see refer- 
ences [35-37]). The G. theta Hbs are 3/3 globins related 
to the FHbs and the related single domain globins found 
in bacteria [37] and protists [35]. Although the functions 
of metazoan Hbs vary widely, from oxygen transport and 
storage to enzymatic [38], the latter are obviously more 
likely in bacteria and in microbial eukaryotes [35]. Our 
molecular phylogenetic analyses suggest that the G. 
theta Hbs are related to the stramenopile and chloro- 
phyte single domain globins. 

Reviewers' comments 

Reviewer's reporti: Purificacion Lopez-Garcia 
(Centre National de la Recherche Scientifique, France) 

This discovery note describes the presence of hemoglobin 
genes in the genome of a cryptomonad species and 
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Figure 1 (See legend on next page.) 
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(See figure on previous page.) 

Figure 1 Phylogeny of G. theta Hbs. (A) Bayesian phylogenetic tree of the 13 globin domains of G. theta aligned using MAFFT L-INS-i MSA. 
Bayesian phylogenetic reconstruction was performed by MrBayes 3.2.2 employing a mixed substitution model. MCMCMC sampling was carried 
out using 2 independent runs for I'OOO'OOO generations on the CIPRES web portal [39]. Support values at branches represent Bayesian posterior 
probabilities (>0.5). (B) Bayesian phylogenetic tree based on a Clustal Omega MSA, of G. theta Hbs with representative vertebrate, protist and 
choanoflagellate Hbs, using the two Bacillus nonheme globins as outgroup [23]. Bayesian phylogenetic reconstruction was performed by MrBayes 
3.2.2 employing a mixed substitution model. MCMCMC sampling was carried out using 2 independent runs for lO'OOO'OOO generations on the 
CIPRES web portal [39]. All globin sequences are identified by full species name, the number of residues, the abbreviated phylum and family 
names, and their identification numbers. Support values at branches represent Bayesian posterior probabilities (>0.5). Abbreviations of protist 
taxons: ALV - Alveolate; AMOE - Amoebozoa; CHOA - Choanoflagellate; CHL - Chlorophyte; FIL - Filasterea; ICH - Ichthyosporea; 
HAP - Haptophyte; RHO - Rhodophyte; STR - Stramenopile. 



presents a phylogenetic analysis of those sequences. The 
observation of hemoglobin genes in cryptomonads may 
have some interest, although it does not come as a sur- 
prise taking into account that globin genes seem univer- 
sally distributed and have been already detected in very 
distant eukaryotic lineages, including not only plants and 
animals but also various divergent protist groups. How- 
ever, the phylogenetic analyses presented bring little light 
on the origin and the evolution of cryptophyte globins. In- 
deed, despite the laudable efforts carried out by the au- 
thors to extract some phylogenetic information from 
these genes, the phylogenetic trees are not well resolved 
and many nodes are not supported. This means that the 
remaining phylogenetic signal in these genes is low and, 
unfortunately, of little use to discuss about globin evolu- 
tion in eukaryotes. 

My major problem with this note is that, despite the 
very limited phylogenetic information carried out by 
these genes, the authors try to make sensible conclu- 
sions out of it. Unfortunately, apart from being globin 
homologs, little more can be said from a phylogenetic 
perspective. Hence, some affirmations seem out of place 
or meaningless; for instance: 

- Abstract and last sentence of manuscript: 
"phylogenetic analyses suggest that the G. theta Hbs 
are related to the globin lineage that gave rise to 
chlorophyte and land plant Hbs and to animal 
globins, including vertebrate neuroglobins". Plants 
and animals belong to two extremely distant 
eukaryotic lineages, so that this equals saying that G. 
theta Hbs are eukaryotic. This does not provide any 
information as to the evolution of cryptomonad Hbs 
within eukaryotes or as to their proximity to other 
eukaryotic groups. 

Author's response: We have altered the last sentences of 
the Abstract and the manuscript, to reflect the only limited 
conclusion we can make, namely that the cryptomonad 
Hbs cluster with chlorophyte and stramenopile Hbs. 

- Page 5 of Findings: "This result illustrates the 
positive aspect of the available globin phylogenetic 



trees, namely a consistent and reproducible 
clustering of certain groups of sequences, despite the 
typically less than robust statistical node support". 
The authors point here to the main problem of 
their analysis, the poor statistical support, but at the 
same time, they would like to believe that the 
clustering is robust, because it seems reproducible. 
However, the occurrence of low statistical support 
is incompatible with solid clustering. Values of 0.5 
or 0.6 that can be seen in many nodes of the tree 
shown in Figure 1 would imply that in around 40 to 
50% of cases, that node is not recovered by the 
phylogenetic analysis. 

Author's response: We agree with the reviewer's criti- 
cism, since we point it out ourselves. We have rewritten 
the relevant section in Findings. Based on the suggestion 
by the second reviewer we have performed novel Bayesian 
analyses with additional outgroup sequences using differ- 
ent MSAs and according Guidance evaluations. The new 
tree shown in Figure IB based on a Clustal Omega MSA 
contains reasonable Bayesian posterior probability sup- 
port values for the clustering of G. theta globins with 
chlorophyte and stramenopile Hbs as well as convincing 
support for the lower clade containing all stramenopile, 
cryptomonad, amoebozoa, filasterea, ichthyosporea 
and rhodophyte globins used in the analysis. Further- 
more, the additional results based on MAFFT and 
MUSCLE MSA's with bacterial nonheme globin se- 
quences as outgroup, are in very good agreement with 
each other, despite the statistical shortcomings. This is 
as good a result as can be obtained with single MSA 
trees of highly divergent and short globin proteins from 
more basal organisms. Although we agree with the 
reviewer that discussion of phylogenetic relationships 
between animal neuroglobins, plant and protist Hbs is 
not appropriate based on our results, we believe that 
a limited conclusion about the clustering of the G. 
theta Hbs with chlorophyte and stramenopile Hbs is 
acceptable. 

The authors must provide a tree with all the complete 
species names. The phylogenetic tree shown is very diffi- 
cult to follow and the abbreviations for species that the 
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authors use are not given. They may want to color them 
as a function of their taxonomic group, instead of (or in 
addition to) adding phyla letter codes. 

Author's response: We have altered the presentation of 
the tree according to the suggestion by the reviewer to im- 
prove its readability. Protist taxa were given a color code 
and species names were written in full in Figure 1. All 
species name abbreviations used in the supplemental 
trees were included in supplemental file 3. In all supple- 
mental trees the color code was adapted accordingly. 

Reviewer's report 2: Igor B Rogozin (NIH, United States of 
America) 

The authors found 13 globin domains in the genome of 
the cryptomonad Guillardia theta. Several globins have 
atypical intron locations. These are interesting findings. 

I am not sure that the tree shown in the Figure IB 
is indeed a correctly rooted tree. The authors claimed 
that they use "Three plant 3/3 Hbs, one LegHb and 2 
NsHbs ... as outgroup". I cannot conclude that this is 
the correct outgroup, this is my impression when I 
look at the tree (Figure IB). Thus any conclusions 
about phylogenetic clustering/nesting may not be sup- 
ported by this tree. The authors may try some other 
outgroups in order to confirm that this is the correct 
rooting and/or provide strong support that they used 
correct outgroups. Outgrouping is a problem, for ex- 
ample, https:/ /www.blackwellpublishing.com/ridley/tu- 
torials/The_reconstruction_of_phylogenyl3.asp. 

Author's response: We appreciate the reviewer's con- 
cern and agree that rooting might be a problem. Conse- 
quently, we have sought sequences other than plant Hbs 
to use for rooting. In the revised manuscript, we show in 
Figure IB a Clustal Omega MSA rooted with the non- 
heme globin sequences from Bacillus. Although the se- 
quences have structures that exhibit a 3/3 Mb fold, the 
heme binding cavity is defective due to the pulling apart 
of helical strands. We also use the Bacillus sequences to 
root MAFFT and MUSCLE MSAs of the same set of se- 
quences as used in Figure IB. The resulting Bayesian 
trees are now Additional files 5 and 6 in the revised 
manuscript. 

Additional files 



Guithe_1 1 0_EKX33440.1, Guithe_l 22_EKX391 52.1 , 
Guithe_1 26_EKX391 52.1 , Guithe_l 26_EKX391 24.1 , 
Guithe_126_EKX46654.1 and Guithe_21 1_EG728842.1 . The intron 
locations in the remaining sequences are variable, marked in green for 
phase 0, yellow for phase 1 and red for phase 2. 

Additional file 3: The sequences of the G. theta Hbs and other 
plant, protist and metazoan Hbs used in the phylogenetic analyses. 

Additional file 4: Maximum likelihood tree of the Clustal Omega 
MSA of G. theta Hbs with representative vertebrate, protist, 
choanoflagellate Hbs and plant Hbs using two Bacillus nonheme 
globins as outgroup. ML analysis was performed by MEGA 5.2 under a 
WAG substitution model. The resulting trees was tested by bootstrapping 
with 100 replicates. Same sequences as in Eigure IB and Additional file 3. 
All globin sequences are identified by the first three letters of the genus 
name and the first three letters of the species name, the number of 
residues, the abbreviated phylum and family names, and their 
identification numbers. Support values at branches represent bootstrap 
percentages (>50) of ML analysis. Abbreviations of protist taxons: 
ALV - Alveolate; AMOE - Amoebozoa; CHOA - Choanoflagellates; 
CHL - Chlorophyte; EIL - Filasterea; ICH - Ichthyosporea; HAP - Haptophyte; 
RHO - Rhodophyte; STR - Stramenopile. 

Additional file 5: Bayesian phylogenetic tree based on a MAFFT 
MSA, of G. theta Hbs with representative vertebrate, protist, 
choanoflagellate Hbs and plant Hbs using two Bacillus nonheme 
globins as outgroup. Bayesian phylogenetic reconstruction was 
performed by MrBayes 3.2.2 employing a mixed substitution model. 
MCMCMC sampling was carried out using 2 independent runs for lO'OOO' 
000 generations on the CIPRES web portal [39]. All globin sequences are 
identified by the first three letter of the genus name and the first three 
letters of the species name, the number of residues, the abbreviated 
phylum and family names, and their identification numbers. Support 
values at branches represent Bayesian posterior probabilities (>0.5). 
Abbreviations of protist taxons: ALV - Alveolate; AMOE - Amoebozoa; 
CHOA - Choanoflagellates; CHL - Chlorophyte; FIL - Eilasterea; 
ICH - Ichthyosporea; HAP - Haptophyte; RHO - Rhodophyte; 
STR - Stramenopile. 

Additional file 6: Bayesian phylogenetic tree based on a MUSCLE 
MSA, of G. theta Hbs with representative vertebrate, protist, 
choanoflagellate Hbs and plant Hbs using two Bacillus nonheme 
globins as outgroup. Bayesian phylogenetic reconstruction was 
performed by MrBayes 3.2.2 employing a mixed substitution model. 
MCMCMC sampling was carried out using 2 independent runs for lO'OOO' 
000 generations on the CIPRES web portal [39]. All globin sequences are 
identified by the first three letter of the genus name and the first three 
letters of the species name, the number of residues and the abbreviated 
phylum and family names. Support values at branches represent Bayesian 
posterior probabilities (>0.5). Abbreviations of protist taxons: 
ALV - Alveolate; AMOE - Amoebozoa; CHOA - Choanoflagellates; 
CHL - Chlorophyte; EIL - Eilasterea; ICH - Ichthyosporea; 
HAP - Haptophyte; RHO - Rhodophyte; STR - Stramenopile. 

Additional file 7: A diagrammatic representation of the phylogeny 
of vertebrate globins. Figure one from ref [28]. 
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Cygb: Cytoglobin; GbE: Globin E, an eye-specific avian globin; GbX: Globin X, 
found in fish, amphibians and gnathostomes; GbY: Globin Y found in 
amphibians; Hb: Hemoglobin; HGT: Horizontal gene transfer; LECA: Last 
Universal Eukaryote common ancestor; LegHb: Leghemoglobin; 
Mb: Myoglobin; ML: Maximum likelihood; MSA: Multiple sequence alignment; 
Ngb: Neuroglobin; NsHb: Nonsymbiotic plant Hb; SDgb: Single domain 3/3 
globin related to the N-terminal of EHbs. 
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Additional file 1: Similarity matrix based on a MAFFT MSA of the G. 
theta Hbs. 

Additional file 2: A MAFFT alignment of the 13 globin domains of 
G. theta with sperm whale Mb. The Mb fold template consists of 
predominantly hydrophobic residues at 37 positions, defining helices A 
through H: A8, All, A12, A15, B6, B9, BIO, B13, B14, C5, CDl, CD4, E4, E7, 
E8, Ell, E12, E15, E18, E19, Fl, F4, F8, FG4, G5, G8, Gil, G12, G13, G15, 
G16, H7, H8, Mil, H12, H15 and H19. Although the proximal residue at 
position F8 (P) is always His, the distal residue (D) at position E7 is mostly 
Met No introns were observed in Guithe_107_EKX331 1 2.1, 
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